Criticality-driven Token Dataflow Optimizations for FPGA-based Sparse LU Factorization
ثبت نشده
چکیده
Performance of FPGA-based token dataflow architectures is often limited by the long tail distribution of parallelism in the compute paths of dataflow graphs. This is known to limit speedup of dataflow processing of Sparse LU factorization to only 3– 10× over CPUs. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically during graph pre-processing and dynamically at runtime. We statically restructure the high-fanin dataflow chains using a technique inspired by Huffman encoding where we provide faster routes for late arriving inputs as predicted through our timing models. We also perform a fanout decomposition and selective node replication in order to distribute serialization costs across multiple PEs. This static restructuring overhead is small; roughly the cost of a single iteration, and is amortized across 1000s of LU iterations at runtime. Additionally, we modify the dataflow firing rule in hardware to prefer critical nodes when multiple nodes are ready for dataflow evaluation. We compute this criticality offline through a one-time slack analysis and implement this in hardware at virtually no cost through a trivial address encoding ordered by criticality. For dataflow graphs extracted for sparse LU factorization, we demonstrate up to 2.5× (mean 1.21×) improvement when using the static preprocessing alone, a 2.4× (mean 1.17×) improvement when using only runtime optimizations alone while an overall 2.9× (mean 1.39×) improvement when both static and runtime optimizations are enabled across a range of benchmark problems.
منابع مشابه
Criticality-driven Token Dataflow Optimizations for FPGA-based Sparse LU Factorization
Performance of FPGA-based token dataflow architectures is often limited by the long tail distribution of parallelism in the compute paths of dataflow graphs. This is known to limit speedup of dataflow processing of Sparse LU factorization to only 3– 10× over CPUs. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically ...
متن کاملTag Management in a Reconfigurable Tagged-Token Dataflow Architecture
Combining dataflow concepts with reconfigurable computing provides a great potential to exploit the application parallelism efficiently. However, to express such parallelism cannot be a trivial task. Therefore, there is a great effort to automatically translate programs originally written in procedural languages (like C and Java) into dataflow architectures which express the parallelism in a na...
متن کاملOut-of-Order Dataflow Scheduling for FPGA Overlays
We exploit floating-point DSPs in the Arria10 FPGA and multi-pumping feature of the M20K RAMs to build a dataflow-driven soft processor fabric for large graph workloads. In this paper, we introduce the idea of out-of-order node scheduling across a large number of local nodes (thousands) per processor by combining an efficient node tagging scheme along with leading-one detector circuits. We use ...
متن کاملAn NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads
Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Su...
متن کاملParallel Direct Solution of Linear Equations on FPGA-Based Machines
The efficient solution of large systems of linear equations represented by sparse matrices appears in many tasks. LU factorization followed by backward and forward substitutions is widely used for this purpose. Parallel implementations of this computation-intensive process are limited primarily to supercomputers. New generations of Field-Programmable Gate Array (FPGA) technologies enable the im...
متن کامل